-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Kernel]Remove two file IO in CRC loading by reusing log listing #4112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| // Must be final to be used in lambda | ||
| final AtomicBoolean hasReturnedAnElement = new AtomicBoolean(false); | ||
| final AtomicBoolean hasReturnedlogOrCheckPoint = new AtomicBoolean(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasReturnedCommitOrCheckpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a comment here. Something like: This variable is used to help determine if we should throw an error if the table history is not reconstructable. Only commit and checkpoint files are applicable.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/DeltaLogActionUtils.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| hasReturnedAnElement.set(true); | ||
| // Only log and checkpoint could use to construct table state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delete this comment line
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/SnapshotManager.java
Outdated
Show resolved
Hide resolved
| newVersion, | ||
| deltasAfterCheckpoint, | ||
| latestCompleteCheckpointFileStatuses, | ||
| listedChecksumFileStatues.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it just be the last one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated
...el/kernel-defaults/src/test/scala/io/delta/kernel/defaults/LogReplayEngineMetricsSuite.scala
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/snapshot/LogSegment.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/LogReplay.java
Show resolved
Hide resolved
| + list.stream().map(FileStatus::toString).collect(Collectors.joining(",\n ")); | ||
| } | ||
|
|
||
| public Optional<FileStatus> getLatestChecksum() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I think these needs method docs? This checksum might not be for the version that this LogSegment represents, right?
also ... suppose we have 10.checkpoint and 11.json and 12.json ...
what are the possible ranges of the version for which this CRC could be for? Could we have 8.crc? what are the guarantees of this version range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For logSegment, we just have 8.crc, downstream(LogReplay) decides use it or not.
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/LogReplay.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/LogReplay.java
Show resolved
Hide resolved
| /** Returns the crc info for the current snapshot if the checksum file is read */ | ||
| public Optional<CRCInfo> getCurrentCrcInfo() { | ||
| return currentCrcInfo; | ||
| return cachedCrcInfo.filter(crc -> crc.getVersion() == logSegment.getVersion()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- to clarify: the consumers of this API only ever want the CRCInfo if it was read during log replay? They never just want the CRCINfo for the latest version, even if it wasn't read during log replay? That semantic seems a bit odd to me -- can you please clarify / explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the consumer just try to get crcInfo of the current version. We return if it is cached. Updating the cache is not necessary during log replay's construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the consumer just try to get crcInfo of the current version.
But this API will only return it if it was read / cached?
My confusion is this:
- it seems like consumers want the CRC info for the current version. full stop. if it exists, they want it
- this API returns it if and only if it exists and we had to read it to update the snapshot hint
So, there seems to be a gap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the concern here is --- when version hint is used, then crcInfo will always empty. This was unfortunately the behavior before, although problematic.
I updated with a fix. (Previously it will cause CRC missing when snapshot hint is used, now it is getting worse as we load domain metadata, so agree to fix it along the way.)
Which Delta project/connector is this regarding?
Description
Incremental CRC loading for P&M was added in #4077.
For snapshot N, we try to read N.crc, if N.crc missing, we list CRC files and find the one up to N-100. These two file listing operation could be actually avoided, because we have already listed files under _delta_log folder.
This PR collects latest CRC in the first file listing and remove two unnecessary IO.
How was this patch tested?
LogReplay test
Does this PR introduce any user-facing changes?
No